Skip to content

feat(parser): adopt section-number module in refs/inference/text parsers#115

Merged
thewrz merged 15 commits into
mainfrom
feat/section-number-parsers
Jun 6, 2026
Merged

feat(parser): adopt section-number module in refs/inference/text parsers#115
thewrz merged 15 commits into
mainfrom
feat/section-number-parsers

Conversation

@thewrz

@thewrz thewrz commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Stacked on #114. Fixes the silent suffix-truncation data corruption across all content-scanning parsers by adopting sectionNumberFragment() / normalizeSectionNumber():

  • parser/refs (prose citations): See Section 26 00 13.10 no longer stores a wrong link to base 26 00 13 — the suffix is captured and normalized.
  • lib/infer-section: keyword scan keeps .33 (01 33 23.3301 33 23), bare suffixed headers infer instead of returning none, inline titles after suffixed numbers are extracted again, and dash-separated inline titles now strip the dash (parity with the text parser).
  • parser/text: .txt headers keep suffixed sections AND their dash titles (previously SECTION 27 05 13.43 - TITLE truncated the section and lost the title).
  • parser/sec: SCN/SRF values normalize-or-verbatim — whitespace/NBSP dirt canonicalized, tagged values never rejected (3 corpus files carry SCN dirt). Internal-dirt normalization is mutation-test pinned.

Test Plan

  • pnpm test src/parser/ src/lib/infer-section.test.ts — regression tests named by symptom (e.g. 'infer-section: keyword scan keeps .33 — 01 33 23.33 is not 01 33 23')
  • pnpm lint && pnpm test — 564 unit tests green at this commit; pre-existing suffix pins (27 05 13.43) untouched

Out of Scope

This PR does NOT tighten any validation gate (API schemas/worker gate land in PR 3; DB CHECK in PR 4). Parser outputs for conforming inputs are unchanged; dirty inputs are strictly improved.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Enhanced section citation parsing to properly handle dotted suffixes (e.g., "Section 26 00 13.10") and agency-formatted section numbers.
    • Improved whitespace normalization in section references, converting NBSP separators to canonical spacing for consistent matching.
  • Tests

    • Expanded test coverage for edge cases in section number parsing and citation extraction.

@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR systematizes section-number canonicalization across the parsing and inference pipeline by importing normalizeSectionNumber and sectionNumberFragment helpers and applying them consistently during section extraction in the inference layer, text parser, SEC parser, and reference extraction modules, including corresponding test coverage.

Changes

Section-number canonicalization pipeline

Layer / File(s) Summary
Infer section layer normalization
src/lib/infer-section.ts, src/lib/infer-section.test.ts
Build SECTION and bare-number regexes from sectionNumberFragment(). Normalize extracted sections via normalizeSectionNumber() in scanKeyword and scanBareNumber. Fix inline-title capture to use inlineMatch[2]. Tests verify decimal suffix preservation, agency-suffixed formats, inline-title extraction with dash/em-dash separators, and bare suffixed headers.
Text parser section normalization
src/parser/text/index.ts, src/parser/text/index.test.ts
Build SECTION regex patterns from sectionNumberFragment(). Normalize extracted sections in extractSectionMeta() by passing first capture group through normalizeSectionNumber(), replacing multi-group assembly. Tests cover dash-delimited titles, agency-suffixed headers, bare suffixed header parsing, and verification that dashed headers are classified as metadata and not body text.
SEC parser section normalization
src/parser/sec/index.ts, src/parser/sec/index.test.ts
Normalize <SCN> section numbers via normalizeSectionNumber() after entity decoding, with fallback to raw decoded value. Normalize <SRF> tag contents for targetSpecSection fields. Tests verify whitespace canonicalization for SCN variants, NBSP-to-canonical conversion for SRF targets, and preservation of unnormalizable content like APPENDIX B.
Reference extraction section normalization
src/parser/refs/extract.ts, src/parser/refs/extract.test.ts, src/parser/refs/rules.ts, src/parser/refs/rules.test.ts
Normalize section numbers in buildRef by extracting from first capture group and applying normalizeSectionNumber() with fallback to trimmed raw. Update CSI section reference rule regex to use sectionNumberFragment() instead of fixed 2-2-2 numeric pattern. Tests verify dotted-suffix preservation (26 00 13.10), NBSP normalization, and agency-suffix extraction.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • wrzonance/SpecR#62: Both PRs evolve section/title inference in the same area; #62 is the foundational work on src/lib/infer-section.ts section-number matching and normalization.
  • wrzonance/SpecR#76: Both PRs update src/parser/refs/extract.ts and src/parser/refs/rules.ts to normalize section-number construction in targetSpecSection and the CSI section reference rule.
  • wrzonance/SpecR#112: Both PRs modify src/parser/sec/index.ts's SCN and section-title parsing in parseSec, with different approaches to normalizing section numbers and handling entity decoding.

Poem

🐰 Hops through sections near and far,
Normalizes numbers, fixing what they are.
From XX XX XX to XX XX XX.YY,
Canonicalized sections reach the sky!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(parser): adopt section-number module in refs/inference/text parsers' clearly and specifically summarizes the main change—the adoption of section-number module functions across multiple parser components.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/section-number-parsers

Comment @coderabbitai help to get the list of available commands and usage tips.

@thewrz

thewrz commented Jun 6, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lib/infer-section.ts`:
- Line 18: The ReDoS concern is a false positive because sectionNumberFragment()
returns a fixed module-local constant FRAGMENT composed of fixed-width pieces
(e.g., \d{2}, optional \.\d{2}, optional single agency token) and is
interpolated into new RegExp(...) in infer-section.ts; update the code by adding
a concise inline comment next to sectionNumberFragment() and the new RegExp(...)
call explaining that FRAGMENT is fixed-width and contains no nested
unbounded/repeating groups, and optionally add a small unit/perf test for
infer-section.ts that runs the regex against long adversarial strings to prove
it doesn't exhibit catastrophic backtracking.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: a7f86d70-a57d-4ac5-9819-058461614eac

📥 Commits

Reviewing files that changed from the base of the PR and between fb18cf3 and 0d7ecd4.

📒 Files selected for processing (10)
  • src/lib/infer-section.test.ts
  • src/lib/infer-section.ts
  • src/parser/refs/extract.test.ts
  • src/parser/refs/extract.ts
  • src/parser/refs/rules.test.ts
  • src/parser/refs/rules.ts
  • src/parser/sec/index.test.ts
  • src/parser/sec/index.ts
  • src/parser/text/index.test.ts
  • src/parser/text/index.ts

Comment thread src/lib/infer-section.ts
Base automatically changed from feat/section-number-lib to main June 6, 2026 20:46
@thewrz thewrz merged commit 3052a21 into main Jun 6, 2026
5 checks passed
@thewrz thewrz deleted the feat/section-number-parsers branch June 6, 2026 20:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant